Segmenting functional tissue units across human organs using community-driven development of generalizable machine learning algorithms

The development of a reference atlas of the healthy human body requires automated image segmentation of major anatomical structures across multiple organs based on spatial bioimages generated from various sources with differences in sample preparation. We present the setup and results of the Hacking the Human Body machine learning algorithm development competition hosted by the Human Biomolecular Atlas (HuBMAP) and the Human Protein Atlas (HPA) teams on the Kaggle platform. We create a dataset containing 880 histology images with 12,901 segmented structures, engaging 1175 teams from 78 countries in community-driven, open-science development of machine learning models. Tissue variations in the dataset pose a major challenge to the teams which they overcome by using color normalization techniques and combining vision transformers with convolutional models. The best model will be productized in the HuBMAP portal to process tissue image datasets at scale in support of Human Reference Atlas construction.

While the additional information regarding the selection of the Dice metric as the primary evaluation metric is helpful and relatable, the previously raised problems of the chosen metric remain. Considering that mAP is the most predominant evaluation metric in the object detection domain and is actively used for many years by several large benchmarks which shows the broad adoption of the metric. Furthermore, other Kaggle Challenges also utilized metrics to quantify the detection performance and thus it remains unclear if there might have been a better solution for this. While this can not be changed after concluding the challenge, some kind of discussion or qualitative post-hoc analysis could have been provided in the manuscript.
Nevertheless, since semantic segmentation was chosen as the task formulation of the challenge, an extended analysis in the provided manuscript should be added to shed some light on dis-/advantages of the top performing solutions. Some examples are listed below: * analysis of individual failure cases, i.e. did all the algorithms fail at the same FTUs? Were there individual images which had particularly bad performance or were only single FTUs missed? * additional visualizations of the final results in terms of violin-and/or bar plots could be used to provide additional evidence on the previously mentioned point. * [optional, since this is probably less relevant in the context of FTU segmentation] boundary based metrics such as the boundary dice could be used to analyze the behavior of the algorithms at the boundaries of the FTUs and potentially provide additional insights into algorithmic performance.
Without the above mentioned points the current manuscript lacks methodological insight for future competition organizers or participants. Furthermore, post hoc analysis on the stability of the selected challenge metric (and potentially additional auxiliary metrics) could be conducted to investigate the stability of the ranking. All of these should be backed by some commonly used statistical tests.
While the provided ablation experiments in the appendix already provide a small glimpse of the top performing algorithms, much more detailed information on the employed training strategies should be added to the manuscript (potentially to the supplement), some examples are given below: * Detailed information on the used ensembles for each of the three top performing methods (while the text in the main body gives a rough outline of the used model, more detailed in the supplement would be highly appreciated, e.g. which conv nets were used inside the ensembles) * Which external data sets were used by the top performing methods? This is important information since external data could be one of the driving factors of winning solutions and employing more or different external data might be as important as choosing the right model * Did all teams use the same pseudo labeling techniques or were there differences? * Ablation experiments of Team 2 are missing and could potentially provide some additional insight into their experiments. * The tables in the appendix are very hard to comprehend and thus require more detailed explanations of the experiments and changes. * Qualitative results from the predictions of the methods should be added to highlight success and failure cases of the methods. * [optional] Due to the large number of teams, the detailed analysis could be extended to the top 5 or even 10 performing teams in order to provide a more comprehensive insight of the methods.
In conclusion, the challenge attracted a lot of interest by the community and a very large number of participants competed in the presented challenge highlighting its impact on the domain and the general interest in FTU segmentation. Unfortunately, the current descriptions of the top performing methods are not detailed enough and an extended evaluation is needed to fully leverage the available information. In its current form, the manuscript is limited to rather shallow methodological insight from the top performing methods.

Point-by-Point Response to Reviewer Comments
Reviewer #2 (Remarks to the Author): First, I would like to thank the authors for their comprehensive response and addressing some of the raised concerns. Especially, the added background on FTUs and the provided reasoning for the scientific and diversity prices were addressed. While the additional information regarding the selection of the Dice metric as the primary evaluation metric is helpful and relatable, the previously raised problems of the chosen metric remain. Considering that mAP is the most predominant evaluation metric in the object detection domain and is actively used for many years by several large benchmarks which shows the broad adoption of the metric. Furthermore, other Kaggle Challenges also utilized metrics to quantify the detection performance and thus it remains unclear if there might have been a better solution for this. While this can not be changed after concluding the challenge, some kind of discussion or qualitative post-hoc analysis could have been provided in the manuscript.
Authors: Thank you for your expert comments. We have now added further discussion and a post-hoc analysis to the manuscript (see Qualitative Analysis of Predictions in Results and Statistical Analysis in Methods). Specifically, we have added Intersection-Over-Union (IOU)-another popular metric for semantic segmentation tasks--as an auxiliary metric and computed that for the top-3 winning teams, as well as the top-50 teams, to assess how a different metric would have impacted the results. We found that while using IOU leads to some changes in the top-50 team rankings, the top-3 teams rank the same.

Authors:
We have now added a Figure 4 which shows the violin plots for mean Dice scores and mean IOU scores for all 3 winning teams, broken down by organs. For each violin plot, the individual image scores are also plotted as a swarm plot overlaid on top of the violin plots to highlight the spread and outliers.

* [optional, since this is probably less relevant in the context of FTU segmentation] boundarybased metrics such as the boundary dice could be used to analyze the behavior of the algorithms at the boundaries of the FTUs and potentially provide additional insights into algorithmic performance.
Authors: Considering the varying number of instances per image and the presence of touching FTUs, we decided not to compute the boundary based metrics such as Hausdorff Distance and Hausdorff Distance at 95th percentile. We made this choice based on the "Metrics Reloaded" paper (https://arxiv.org/abs/2206.01653) and the detailed rubrics provided within.
Without the above mentioned points the current manuscript lacks methodological insight for future competition organizers or participants. Furthermore, post hoc analysis on the stability of the selected challenge metric (and potentially additional auxiliary metrics) could be conducted to investigate the stability of the ranking. All of these should be backed by some commonly used statistical tests.

Authors:
We have added mean IOU scores for the top-50 teams and compared how that would affect the rankings. We found that while using IOU leads to some changes in the top-50 team rankings, the top-3 teams rank the same. Additionally, we have added a study detailing how removing the worst predictions impacts the scores and the rankings. While the scores improve slightly, the rankings stay the same in all cases, except when removing the worst three cases per organ, team 3 ranks first based on dice score. We computed Kendall's Rank Correlation to further investigate the ranking stability and observe high correlation between the rankings based on mean dice score and mean IOU score but not a perfect alignment. These results have been added to the Statistical Analysis section under Methods.
While the provided ablation experiments in the appendix already provide a small glimpse of the top performing algorithms, much more detailed information on the employed training strategies should be added to the manuscript (potentially to the supplement), some examples are given below: * Detailed information on the used ensembles for each of the three top performing methods (while the text in the main body gives a rough outline of the used model, more detailed in the supplement would be highly appreciated, e.g. which conv nets were used inside the ensembles) * Which external data sets were used by the top performing methods? This is important information since external data could be one of the driving factors of winning solutions and employing more or different external data might be as important as choosing the right model * Did all teams use the same pseudo labeling techniques or were there differences? Authors: We have added detailed information on the model architectures, training details, external data, and pseudo labeling techniques of the three winning teams to the Supplementary Information.

* Ablation experiments of Team 2 are missing and could potentially provide some additional insight into their experiments.
Authors: Unfortunately, since ablation study is not a part of the final submission, not all teams track their experiments. While team 1 and team 3 provided their experiments voluntarily, team 2 did not. We reached out to Team 2 and they informed us that they do not have this information. * The tables in the appendix are very hard to comprehend and thus require more detailed explanations of the experiments and changes.

Authors:
We have now added a further explanation of the ablation studies to the tables in the supplementary information.
* Qualitative results from the predictions of the methods should be added to highlight success and failure cases of the methods.

Authors:
We have now added figures of the five best and five worst predictions per organ for all three winning teams to the supplementary information.

* [optional]
Due to the large number of teams, the detailed analysis could be extended to the top 5 or even 10 performing teams in order to provide a more comprehensive insight of the methods.
Authors: Since only the winning teams are required to submit a detailed documentation as well as training code of their solutions, most teams that do not win don't provide this information. While some teams may choose to post some information in the Discussion forums on the Kaggle competition website, it is generally not very thorough.
In conclusion, the challenge attracted a lot of interest by the community and a very large number of participants competed in the presented challenge highlighting its impact on the domain and the general interest in FTU segmentation. Unfortunately, the current descriptions of the top performing methods are not detailed enough and an extended evaluation is needed to fully leverage the available information. In its current form, the manuscript is limited to rather shallow methodological insight from the top performing methods.
We hope that the proposed changes in the current manuscript address your concerns about this work.